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Abstract 

Research  in  cluster  analysis  has  resulted  in  a  large  number  of  algorithms  and  similarity 
measurements  for  clustering  scientific  data.  Machine  learning  researchers  have  published  a 
number  of  methods  for  conceptual  clustering,  in  which  observations  are  grouped  into  clusters 
which  have  “good”  descriptions  in  some  language.  We  investigate  the  general  properties 
which  similarity  metrics,  objective  functions,  and  concept  description  languages  must  have 
to  guarantee  that  a  (conceptual)  clustering  problem  is  polynomial  time  solvable  by  a  simple 
and  widely-used  clustering  technique,  the  agglomerative-hierarchical  algorithm.  We  show 
that  under  fairly  general  conditions,  the  agglomerative-hierarchical  method  may  be  used  to 
find  an  optimal  solution  in  polynomial  time. 


Keywords:  Cluster  Analysis,  Conceptual  Clustering,  Analysis  of  Algorithms. 


1  Introduction 


There  is  a  wide  body  of  literature  in  several  fields  concerned  with  the  clustering  problem. 
Roughly,  this  is  the  problem  of  how  to  group  observations  into  categories  such  that  members 
of  a  category  are  alike  in  some  interesting  way  and  members  of  different  categories  are 
different.  Within  artificial  intelligence  and  machine  learning,  the  clustering  problem  has 
been  classified  as  part  of  the  general  problem  of  learning  from  observation  and  discovery 
(Carbonell,  Michalski,  &  Mitchell,  1983). 

Much  of  the  work  on  the  clustering  problem  has  involved  numerical  or  statistical  tech¬ 
niques  for  clustering  scientific  data.  Researchers  in  cluster  analysis  and  numerical  taxonomy 
have  focussed  on  developing  appropriate  metrics  for  measuring  similarity  between  points  and 
clusters  (groups  of  points),  and  on  developing  algorithms  to  minimize  inter-cluster  similarity 
as  measured  by  some  objective  function.  The  literature  on  these  techniques  is  scattered 
through  journals  in  statistics,  pattern  recognition,  computer  science,  and  various  fields  of 
application  (biology,  psychology,  sociology,  etc.).  Summaries  may  be  found  in  (Anderberg, 
1973;  Hartigan,  1975;  Duda  &  Hart,  1973),  and  more  recently  (Romesburg,  1984). 

In  machine  learning,  work  on  the  clustering  problem  has  focussed  on  the  notion  of  concep¬ 
tual  clustering,  introduced  by  Michalski  (1980).  Conceptual  clustering  methods  attempt  not 
only  to  produce  “good”  classifications  based  on  some  metric,  but  also  to  find  a  meaningful 
description  of  the  classification.  In  contrast,  cluster  analysis  techniques  leave  it  to  the  human 
analyst  to  determine  the  meaning  of  a  clustering  (here,  and  throughout  the  remainder  of 
the  paper,  we  use  the  term  “cluster  analysis”  to  refer  to  all  clustering  techniques  in  which 
the  quality  of  cluster  descriptions  is  not  a  factor  in  measuring  the  quality  of  the  clustering). 
Researchers  in  conceptual  clustering  have  not  only  produced  a  number  of  algorithms  and 
metrics  (e.g.,  Lebowitz,  1983;  Michalski  &;  Stepp,  1983;  Fisher,  1985;  Mogenson,  1987),  but 
have  also  investigated  employing  clustering  in  problem  solving  (Rendell,  1983;  Fisher,  1987), 
incorporating  problem-specific  knowledge  into  the  clustering  process  (Mogenson,  1987;  Stepp 
&:  Michalski,  1986),  and  providing  appropriate  cluster  description  languages  (Stepp,  1984; 
Fisher,  1985).  Fisher  and  Langley  (1985)  and  Stepp  (1987b)  provide  overviews  of  work  on 
conceptual  clustering,  and  develop  characterizations  of  the  problem. 

Instead  of  adding  to  the  already  large  body  of  clustering  algorithms  and  metrics,  we 
have  explored  the  properties  of  a  simple,  general,  and  well-known  clustering  algorithm,  the 
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agglomerative-hierarchical  or  amalgamative  algorithm.  In  particular,  we  are  interested  in 
characterizing  the  kinds  of  problems,  metrics,  objective  functions  and  concept  description 
languages  on  which  a  given  algorithm  will  succeed.  We  define  success  as  the  discovery  of 
an  optimal  clustering  in  time  polynomial  in  the  number  of  data  points  to  be  clustered. 
Though  introductory  texts  on  cluster  analysis  sometimes  attempt  to  explain  how  to  choose 
an  appropriate  algorithm  and  metric,  there  has  not,  to  our  knowledge,  been  any  formal 
exploration  of  the  conditions  under  which  particular  algorithms  are  guaranteed  to  produce 
optimal  solutions  in  polynomial  time  (there  are,  of  course,  algorithms  designed  to  produce 
an  optimal  solution  for  particular  metrics  and  objective  functions).  By  specifying  such 
conditions,  we  hope  to  simplify  the  problem  of  choosing  an  appropriate  clustering  technique. 

In  the  next  section,  we  introduce  a  simple  conceptual  clustering  problem,  and  use  it  to 
motivate  general  definitions  for  clustering  problems,  distance  metrics,  and  objective  func¬ 
tions.  In  Section  3,  the  example  is  used  to  introduce  the  agglomerative  algorithm,  and  to 
motivate  some  fairly  general  restrictions  on  conceptual  clustering  problems;  in  Section  4, 
these  restrictions  are  proved  sufficient  to  guarantee  that  the  algorithm  finds  an  optimal  solu¬ 
tion.  We  also  examine  further  properties  of  the  agglomerative  algorithm,  as  well  as  variations 
of  the  restrictions  of  Section  3. 


2  Definitions 

We  formally  define  a  very  general  version  of  the  clustering  problem.  The  definitions  are 
similar  in  spirit  to  the  definition  of  the  Abstract  Clustering  Task  given  by  Fisher  and  Langley 
(1985).  However,  we  want  to  more  carefully  specify  what  is  meant  by  a  “clustering  quality 
function.”  We  believe  the  definitions  are  general  enough  to  subsume  the  objective  functions 
normally  used  in  cluster  analysis  and  conceptual  clustering. 

We  motivate  the  definitions,  and  the  properties  of  the  next  section,  with  the  following 
example  of  simplified  conjunctive  conceptal  clustering  (Michalski  &;  Stepp,  1983),  or  mono¬ 
mial  clustering.  The  formal  definition  of  the  monomial  clustering  problem  will  be  given  in 
Section  3.1. 

For  monomial  clustering,  the  objects  to  be  clustered  are  described  with  n  boolean  at¬ 
tributes  Xi,  x2,  ■  ■  ■  ,  xn.  Let  Xn  be  the  set  of  boolean  vectors  over  the  attributes  x1;  x2,  ■  ■  ■ ,  xn. 
Then  the  domain  for  the  monomial  clustering  problem  is  X  =  {Xn}n>i. 
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Similarly,  many  other  domains  for  clustering  problems  are  best  described  as  a  parame¬ 
terized  family  X  =  {Xn}  of  sets,  where  the  parameter  n  is  some  appropriate  value,  typically 
reflecting  the  “size”  of  an  element.  For  example,  if  we  are  interested  in  clustering  objects 
in  Euclidean  space,  then  the  domain  X  might  be  the  collection  {En}n>i,  where  En  is  n- 
dimensional  Euclidean  space. 

For  some  domains  X  =  {Xn},  we  may  allow  Xn  to  be  empty  for  some  (or  most,  if  no 
parameterization  is  desired)  values  of  n.  For  example,  if  the  only  type  of  clustering  problem 
that  we  wish  to  consider  is  clustering  in  2-dimensional  Euclidean  space,  then  we  would  let 
X2  =  E2 ,  and  Xi  =  0  for  i  ^  2. 

The  above  considerations  motivate  the  following 

Definition  2.1  A  domain  X  is  a  parameterized  family  of  sets  {Xn}n>i. 

Our  definitions  will  require  that  a  clustering  algorithm  work  for  all  Xn  that  are  nonempty, 
and  that  the  algorithm  have  run-time  polynomial  in  n. 

In  conceptual  clustering,  a  cluster  is  described  by  a  statement  in  some  language,  and 
not  by  the  set  of  points  in  the  cluster.  For  example,  for  monomial  clustering,  clusters  are 
described  by  monomials  over  n  boolean  attributes.  Let  Ln  be  the  set  of  all  monomials  (pure 
conjunctive  concepts)  over  the  boolean  variables  aq,  x2, .  .  . ,  xn. 

Definition  2.2  A  clustering  description  language  £  is  a  parameterized  family  of  languages 
{-bn}n>  1  • 

The  parameter  n  typically  reflects  the  size  or  length  of  statements  in  the  language  Ln. 
Each  statement  c  £  f  n  is  a  cluster.  The  meaning  of  a  cluster  is  given  by  an  interpretation: 

Definition  2.3  An  interpretation  I  =  {/„}n> i  of  a  language  £  to  a  domain  X  is  a  param¬ 
eterized  family  of  functions  In  \  Ln  — >■  2Xn .  (2Xn  is  the  power  set  of  Xn.) 

For  each  n,  every  cluster  c  £  Ln  describes  a  set  of  points  of  Xn  given  by  /n(c). 

For  monomial  clustering,  the  interpretation  In  of  a  monomial  over  the  variables  xi,  x2, 
...  ,xn  is  the  standard  logical  one,  i.e.,  the  set  of  boolean  vectors  of  length  n  (over  the  same 
variables)  that  satisfy  the  monomial.  For  some  applications,  it  is  desirable  to  parameterize 
the  family  of  languages  £  differently  than  the  domain  X .  For  example,  if  the  goal  is  to 
cluster  numerical  data  points  based  on  their  descriptions  in  a  simple,  hmited  language,  then 
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for  each  n  and  m  there  would  be  an  interpretation  /ni7n  :  Ln  — >  2Xrn .  For  the  sake  of  clarity, 
we  will  assume  that  the  language  and  its  domain  have  the  same  parameter  rr,  the  extensions 
required  for  the  two  parameter  case  are  straightforward. 

A  clustering  C  over  Ln  is  a  finite  set  of  statements  (clusters)  of  Ln.  The  size  of  a  clustering 
C  is  the  sum  of  the  lengths  of  the  statements  in  C .  Let  Kn  denote  the  class  of  all  clusterings 
over  Ln.  For  monomials,  a  clustering  is  simply  a  finite  collection  of  monomials. 

A  cluster  c  covers  a  set  S  C  Xn  iff  S  C  /n(c).  A  clustering  C  covers  a  set  S  C  Xn  iff 
Scy  cec  Ai(c).  The  clustering  C  is  a  prime  clustering  of  a  set  S  iff  C  covers  S  and  there  is 
not  a  proper  subset  C'  C  C  such  that  C'  covers  S .  A  prime  clustering  is  therefore  one  that 
contains  no  extraneous  clusters. 

The  goal  of  clustering  is  to  find,  given  a  finite  subset  S  C  Xn  of  elements,  a  (prime) 
clustering  that  covers  S  (and  possibly  other  elements  of  Xn)  such  that  each  cluster  of  the 
clustering  covers  similar  elements  (i.e.,  is  tight),  and  different  clusters  cover  dissimilar  ele¬ 
ments  (i.e.,  have  large  distance).  The  definitions  of  tightness  and  distance  should  depend 
only  on  clusters,  i.e.,  on  statements  in  the  cluster  description  language,  and  not  on  the 
points  covered  by  those  clusters,  thus  fulfilling  the  primary  condition  of  conceptual  cluster¬ 
ing  (Michalski  &  Stepp,  1983;  Fisher  fe  Langley,  1985). 

The  tightness  function  will  be  parameterized  by  n  in  a  manner  similar  to  domains  and 
clustering  description  languages.  Since  the  clustering  algorithm  must  employ  this  function, 
it  is  unreasonable  to  expect  it  to  work  for  all  n  unless  the  tightness  functions  for  each  n 
are  related  to  the  extent  that  there  is  a  single  algorithm  for  evaluating  them.  For  the  same 
reason,  the  distance  functions  should  be  interrelated  as  well. 

Definition  2.4  A  family  T  =  {Fn}n>i  of  functions  is  uniformly  computable  iff  there  is  an 
algorithm  F  such  that  for  all  n  and  x,  F(n ,  x)  =  Fn[x).  A  family  is  uniformly  polynomial 
time  computable  iff  F  runs  in  time  polynomial  in  the  value  of  n  and  the  size  of  x. 

The  formal  definition  for  tightness  and  distance  functions  are  thus: 

Definition  2.5  T  =  {Tn}n>i  is  a  uniformly  computable  family  of  tightness  functions ,  with 
Tn  :  Kn  — y  where  3R+  denotes  the  nonegative  real  numbers. 

Definition  2.6  =  {Dn}n> i  is  a  uniformly  computable  family  of  distance  functions,  with 

Dn  :  Kn  — »  3?+. 
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Tn  is  a  measure  of  tightness  of  clusterings  over  the  language  Ln.  Note  that  Tn  is  a 
function  of  clusterings,  and  not  of  individual  clusters.  For  monomial  clustering,  a  natural 
measure  of  the  tightness  of  a  monomial  is  simply  the  number  of  attributes  appearing  in  the 
monomial.  A  natural  measure  of  overall  tightness  of  a  clustering  (set  of  monomials)  might 
be  the  minimum  tightness  of  any  monomial  of  the  set.  We  define  this  tightness  function  T 
for  monomials  in  the  next  section. 

Dn  is  a  measure  of  distance  of  clusterings  over  the  language  Ln.  Note  that  Dn  is  a 
function  of  clusterings,  not  of  pairs  of  clusters.  In  the  next  section,  we  define  the  distance 
Dn  of  a  monomial  clustering  to  be  the  minimum  number  of  literals  on  which  any  pair  of 
monomials  of  the  clustering  differ. 

Numerical/statistical  methods  typically  use  a  single  metric  that  measures  either  similarity 
or  dissimilarity  between  points.  The  objective  function  measures  the  overall  quality  of  a 
clustering  relative  to  that  metric.  For  the  sake  of  generahty,  we  have  assumed  separate 
similarity  (tightness)  and  dissimilarity  (distance)  measures,  and  allow  the  objective  function 
(which  we  call  “goodness”)  to  be  a  function  of  these. 

Definition  2.7  Q  =  {Gn}n>i  is  a  uniformly  computable  family  of  goodness  functions,  with 
Gn  :  range(Tn )  X  range[Dn)  — >  3?+. 

The  goodness  Gn  of  a  clustering  is  a  real  number  representing  how  well  both  tightness 
of  clusters  and  distance  between  clusters  has  been  achieved.  We  extend  the  domain  of  Gn 
to  Kn  with  the  natural  interpretation  that  ( \/C  £  Kn )  Gn(C )  =  Gn(Tn(C),  Dn(Cf).  We  are 
now  ready  to  define  clustering  problems. 

Definition  2.8 

•  A  (conceptual)  clustering  problem  is  any  six-tuple  [X ,  £,  I,  T,  *D,  Q)  with  X ,  £,I,  T,  "D, 
and  Q  defined  as  above. 

•  An  instance  of  a  conceptual  clustering  problem  [X ,  £,I,  T,  T>,  Q)  is  a  seven-tuple  ( Xn ,  Ln,  In ,  Tn,  Dn,  Gn, , 
where  Xn  £  X ,  Ln  £  £,  In  £  X,  Tn  £  T ,  Dn  £  T>,  Gn  £  Q ,  and  S  is  any  finite 

nonempty  subset  of  Xn. 

•  The  solution  to  an  instance  ( Xn ,  Ln,  /n,  Tn,  Dn,  Gn,  S)  of  a  conceptual  clustering  prob¬ 
lem  is  a  clustering  C  £  Kn  (called  a  best  clustering^  such  that 
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1.  C  is  a  prime  clustering  of  S. 

2.  For  all  clusterings  C'  that  satisfy  1.  above ,  Gn(C' )  <  Gn{C). 

•  An  algorithm  A  solves  the  conceptual  clustering  problem  [X ,  C,  X,  T,  D,  Q)  iff  for  all  n 
such  that  Xn  is  nonempty,  and  for  any  finite  nonempty  set  S  C  Xn,  the  algorithm  A, 
if  given  n  and  S  as  input,  outputs  a  solution  to  the  instance  ( Xn ,  Ln,  In,  Tn,  Dn,  Gn,  S). 
We  also  write  that  [X ,  C,  X,  T,  F>,  Q)  is  solvable. 

In  what  follows,  the  scope  of  the  variable  n  will  be  all  numbers  such  that  Xn  is  nonempty. 
Thus  the  statement  “for  all  n ”  is  used  to  mean  “for  all  n  such  that  Xn  is  nonempty”. 

Note  that  if  Xn  is  infinite,  there  may  not  exist  a  solution  to  [Xn,  Ln,  In,Tn,  Dn,Gn,  S) 
because  there  may  be  an  infinite  sequence  of  clusterings  for  which  Gn  increases  without 
bound.  Also  note  that  the  solution  (if  it  exists)  of  an  instance  of  a  clustering  problem  may 
not  induce  a  partition  of  the  points  of  S ;  there  is  no  requirement  that  the  clusters  of  the 
solution  cover  disjoint  sets  (of  course,  disjointness  may  be  enforced  by  an  appropriate  choice 
of  Gn).  Further,  the  clusters  may  cover  (possibly  an  infinite  number  of)  points  of  Xn  —  S. 

These  definitions,  and  results  in  the  following  sections,  are  easily  applied  to  the  case  of 
cluster  analysis:  The  concept  description  language  is  simply  finite  subsets  of  Xn,  and  the 
interpretations  I  are  identity  functions  In.  Thus  a  cluster  is  simply  a  finite  set  of  points  of 
Xn- 

Since  we  are  interested  in  feasible  computations,  we  define  polynomial  time  solvability  of 
clustering  problems.  Recall  that  the  parameter  n  typically  reflects  a  natural  measure  of  size 
or  length  of  encoding  of  objects  x  £  Xn. 

Definition  2.9  Let  the  families  T ,  T> ,  andQ  be  uniformly  polynomial  time  computable.  Then 
(X,  C,  T,  T,  T>,  Q)  is  solvable  in  polynomial  time  iff  there  is  an  algorithm  A  and  polynomial 
p  such  that 

1.  A  solves  the  clustering  problem  [X ,  C,X,7','D,Q). 

2.  For  any  n  (for  which  Xn  is  nonempty),  and  any  finite  S  C  Xn,  the  run-time  of  A  on 
input  n  and  S  is  at  most  p(n,  |Sj). 
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3  A  Restricted  Class  of  Clustering  Problems 


The  definitions  in  Section  2  are  so  general  that  it  would  be  ridiculous  to  expect  that  all 
clustering  problems  are  polynomially  solvable  (or  even  solvable,  for  that  matter).  A  main  goal 
within  our  framework  is  to  identify  exactly  those  clustering  problems  that  are  (polynomially) 
solvable,  and  to  give  algorithms  for  solving  them.  Only  by  restricting  the  class  of  domains 
X,  the  languages  C,  and  the  objective  functions  7','D,  and  Q  under  consideration,  can  we 
begin  to  make  progress  toward  this  goal.  (As  it  turns  out,  we  will  not  need  to  make  any 
restrictions  whatsoever  on  the  domains  X .) 

A  natural  algorithm  for  clustering  is  the  following:  Given  a  set  of  elements  S  to  be 
clustered,  begin  by  forming  a  cluster  for  each  element  of  S .  Then,  iteratively,  “merge”  the 
two  clusters  which  are  “closest”.  Halt  when  there  is  only  one  cluster  remaining  (containing 
all  of  the  points  of  S),  and  output  the  best  clustering  encountered  during  this  process.  This 
is  essentially  the  agglomerative  algorithm  formally  specified  in  Section  4. 

In  Section  3.1  we  illustrate  the  agglomerative  algorithm  using  an  instance  of  the  monomial 
clustering  problem.  In  Section  3.2  the  example  is  used  to  motivate  properties  for  C, 
and  Q  that  guarantee  that  the  agglomerative  algorithm  finds  an  optimal  solution. 

3.1  An  Example 

The  monomial  clustering  problem,  discussed  informally  in  the  last  section,  is  defined  by: 

•  X  =  {Xn}n>i,  where  Xn  is  the  set  of  vectors  over  the  variable  set  xi,  x2,  ■  ■  . ,  xn. 

•  £  =  {Tn}n>i,  where  Ln  is  the  set  of  monomials  over  the  same  variables  (cf.,  the  single 
representation  trick  (Cohen  &;  Feigenbaum,  1983)). 

•  X  =  {/n}n> l,  where  In  is  the  standard  logical  interpretation,  i.e. ,  the  interpretation  of 
a  monomial  is  the  set  of  boolean  vectors  that  satisfy  it. 

•  T  =  {Tn}n>i,  where,  if  C  =  {mi,  m2, .  .  .  ,mk}  is  a  clustering  of  Ln, 

Tn(C)  =  min  (tn(m;)}, 

and  tn,  the  tightness  of  a  monomial,  is  the  number  of  attributes  (literals)  in  the  mono¬ 
mial. 
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•  T)  =  {-Dn}n>ij  where,  if  C  =  .  .  .  ,rrik}  is  a  clustering  of  Ln, 

DJC )  =  min  {dn(m;,  ra„)}, 

and  dn,  the  distance  between  two  monomials,  is  the  number  of  attributes  which  appear 
negated  in  one  monomial  and  not  negated  in  the  other.  If  the  clustering  C  has  only 
one  monomial,  arbitrarily  define  Dn[C )  =  0. 

•  Q  =  {Gn}n>i,  where,  if  C  £  Kn,  then  Gn(C )  =  min {Dn(C),Tn(C)}. 

The  above  objective  functions  T,T>,  and  Q  capture  the  following  three  goals:  (a)  a  tight 
clustering  should  contain  only  monomials  that  cover  few  points  (a  difference  of  1  in  the  value 
of  tn  corresponds  to  a  factor  of  2  in  the  number  of  points  of  Xn  covered);  (b)  all  monomials 
found  should  be  disjoint  (a  clustering  C  containing  non-disjoint  monomials  mi  and  m2  will 
have  Dn[C )  =  0,  since  dn(mi,m2)  =  0),  and  should  differ  on  as  many  attributes  as  possible; 
(c)  a  small  value  of  Tn  or  Dn  is  equally  undesirable,  since  the  overall  goal  is  to  maximize  the 
minimum  of  the  two  measures. 

To  see  how  the  agglomerative  algorithm  works  on  the  monomial  clustering  problem,  we 
give  a  sample  run  using  the  instance  (X9,  Lg,  /9,  T9,  Z)9,  G9,  S),  where  the  input  set  S 
consist  of  the  points  (events)  e1; .  .  . ,  e5  as  follows: 


—  X^X2X^X/[X^XqX^X^Xq 


62  —  X1X2  3^3  *^4*^5  *^6*^7  *^8*^9 

63  —  X1X2X3X4X5X6X7XsX9 

e4  —  X 1332X3X4X5X6  33  733g  33g 

65  —  X1X2X3X4X5X6X7X8X9 

For  the  rest  of  this  section,  we  will  refer  to  Xg,  Lg,  Ig,  Tg,  Dg,  Gg,  tg,  and  dg  as 
X,  L ,  I,  T,  D,  G,  t,  and  d,  respectively. 

[Step  1] 

The  agglomerative  algorithm  begins  with  the  clustering 

C 1  =  { rrii  =  e,  :  1  <  i  <  5}. 
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Note  that  this  clustering  is  as  “specific”  as  possible,  in  that  each  cluster  is  contained 
in  some  cluster  of  every  monomial  clustering  that  also  covers  S.  The  goodness  of  this 
clustering  is  2,  since  all  monomials  have  f(m; )  =  9,  and  the  minimum  distance  (between  the 
pairs  (mi,m3),  (mi,m5),  (m2,m4),  and  (m3,m5))  is  2. 

[Step  2] 

The  agglomerative  algorithm  chooses  one  of  the  minimally-distant  pairs  for  merging. 
Assume  it  picks  the  pair  (mi,m5).  The  obvious  way  to  merge  two  monomials  is  to 
simply  drop  the  attributes  on  which  they  differ.  This  results  in  a  new  clustering 
C2  =  {777,1 ,  ■  •  •  ,7714}  where: 


m\  =  XiX2X3X^X$X3Xg 

77X2  —  ^2  —  23  {23  2  333  33433  5  336  337333239 
77X3  —  63  —  33 1332  333  X4  335  33033733g33g 
77X4  —  64.  —  33i332  333334335  33633733g339 

The  new  cluster  mi  covers  the  events  ei  and  e5,  as  well  as  other  points  of  Xn.  Its  tightness 
t  is  7,  and  the  new  minimum  distance  (between  mi  and  m3)  is  1.  Therefore,  the  goodness  of 
the  new  clustering  is  also  1.  Note  that  this  is  less  than  the  goodness  of  the  initial  clustering; 
the  algorithm  does  not  hill-climb  on  this  objective  function. 

[Step  3] 

The  algorithm  merges  the  monomials  mi  and  m3.  This  results  in  a  new  clustering 
C3  =  {mi,m2,m3}  where: 


mi  =  xix2x3x^x3x$ 

77X2  —  G 2  —  3313323J3334335335337333339 

77X3  —  64  —  33i332  333334335  335337338339 

[Step  4] 

The  minimum  tightness  is  6  (mi)  and  the  minimum  distance  is  2  (  between  m2  and 
m3).  The  algorithm  therefore  merges  m2  and  m3,  resulting  in  C4  =  {mi,m2}  where: 

77Xl  =  33l33233333433s33g 
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m2  =  X 1*2 *3*4^5 *8 *9 


This  clustering  has  T  =  6  and  D  =  6,  so  G  =  6.  This  is,  in  fact,  a  best  clustering  of  the 
events  under  the  given  objective  function. 

[Step  5] 

The  last  step  in  the  example  merges  the  two  remaining  clusters  into  a  single  monomial 
to  obtain  the  clustering  C$  =  {0}  with  with  G(C 5)  =  T[C^)  =  D[C 5)  =  0.  The 
algorithm  therefore  keeps  C4  as  the  best  clustering. 

3.2  Properties 

In  this  section,  we  present  some  properties  for  clustering  problems,  using  the  example  from 
the  previous  section  to  provide  intuitive  motivations.  For  a  clustering  problem  (X,  £,I,  T ,  T>,  Q), 
we  state  only  the  (more  restrictive)  properties  sufficient  for  polynomial  time  solvability;  corre¬ 
sponding  properties  for  general  solvability  are  trivially  obtained  by  dropping  the  polynomial 
time  requirements.  In  Section  4,  we  will  prove  that  a  clustering  problem  possessing  the 
properties  is  solvable  in  polynomial  time  because  the  agglomerative  algorithm  meets  the 
requirements  of  Definition  2.9. 

In  our  example,  the  cluster  description  language  and  interpretation  make  it  is  easy  to 
determine  whether  a  point  x  £  Xn  is  covered  by  a  statement  c  £  Ln.  The  ability  to  determine 
cluster  membership  is  necessary  if  the  agglomerative  algorithm  is  to  create  prime  clusterings 
-  the  merging  operation  used  in  the  agglomerative  algorithm  may  produce  a  new  cluster 
whose  interpretation  is  a  superset  of  (the  interpretation  of)  some  cluster  not  involved  in  the 
merge.  To  guarantee  prime  clusterings,  we  must  be  able  to  detect  such  extraneous  clusters. 
The  first  property  therefore  requires  that  we  be  able  to  determine  cluster  membership  in 
polynomial  time. 

Property  Pi:  There  exists  a  polynomial  time  algorithm  such  that  when  given  as  input  any 
number  n,  any  point  x  £  Xn,  and  any  cluster  c  £  Ln,  outputs  “true”  if  a;  £  Jn(c),  and 
“false”  otherwise. 

The  membership  algorithm  may  be  used  to  obtain  a  prime  clustering  from  a  given  clus¬ 
tering  of  a  set  S .  In  particular,  let  the  polynomial  time  subroutine  PRIME(n,  C,  S)  return  a 
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prime  clustering  C'  of  S ,  where  C'  contains  a  subset  of  the  clusters  of  C .  PRIME  iteratively 
examines  each  cluster  c  £  C  and  adds  it  to  C'  iff  there  is  some  point  x  G  S  such  that 
x  G  /n(c)  and  C'  does  not  cover  {x}. 

The  next  property  is  based  on  step  1  of  the  example.  In  this  step,  the  algorithm  created 
a  single  cluster  for  each  point  X{  G  S.  Each  cluster  c;  covered  X{  and  as  few  additional  points 
of  X  as  possible.  For  some  languages  Ln  and  some  sets  S,  there  may  not  exist  a  clustering  of 
individual  points  which  is  “most  specific”  in  this  sense.  Property  P2  asserts  that  the  cluster 
description  languages  must  be  such  that  for  any  set  of  points  in  Xn,  there  must  exist,  and 
there  must  exist  feasible  (polynomial)  means  for  finding,  a  cluster  (statement  in  Ln )  that  is 
the  most  specific  of  any  statement  in  Ln  covering  those  points. 

Definition  3.1  For  a  given  clustering  problem  (X,  C,T,  T,  *D,  Q),  and  any  n,  the  maximally 
specific  cover  (MSC)  for  a  set  of  points  P  C  Xn  is  a  cluster  c  G  Ln  such  that  c  covers  P, 
and  for  any  c'  G  Ln,  if  c'  covers  P,  then  In(c )  C  7n(c'). 

Property  P2:  There  exists  an  algorithm  such  that  when  given  as  input  any  number  n  and 
finite  S  =  {xi,  x2, .  .  .  xs}  C  X„,  outputs  a  clustering  C  =  {ci,  c2, .  .  . ,  cs}  G  Kn  such 
that  for  1  <  i  <  s,  c;  is  an  MSC  for  {x{}.  The  run-time  of  the  algorithm  must  be 
polynomial  in  n  and  |5j. 

Property  P2  imphes  that  the  descriptions  {cj}  of  the  maximally  specific  covers  of  the 
singleton  point  sets  be  at  most  of  size  polynomial  in  the  total  size  of  S  (otherwise,  the 
algorithm  may  spend  exponential  time  just  creating  them). 

The  next  two  properties  are  based  on  the  merging  operation  in  the  example.  The  agglom- 
erative  algorithm  assumes  that  D[C)  for  a  clustering  is  based  on  an  inter  -cluster  distance 
measure  d\ 

Property  P3:  There  is  a  family  of  uniformly  polynomial  time  computable  functions  d  = 
{dn},  where  dn  :  Ln  X  Ln  — »  5J+,  such  that  for  all  n,  and  C  G  Kn, 

(a)  Dn{C )  =  min{dn(ci,  Cj )  :  Ci,  Cj  G  C,Ci  Cj}. 

When  restricted  to  single-cluster  clusterings,  {Dn}  may  be  any  family  of  uniformly 
polynomial  time  computable  functions  satisfying  (b)  below. 
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(b)  (Vci,c2,C3,c4  £  Ln )  (/n(ci)  c  Jn(c2))  and  (/n(c3)  C  Jn(c4))  =>-  dn(ci,c3)  > 

dn  ( C2  ,  C4). 

If  Ci  =  {ci}  and  C2  =  {c2}  are  clusterings  containing  only  single  clusters,  then 

/„(ci)  c  J„(c2)  =>  Pn(Ci)  >  Dn(C2). 

Property  P3  requires  that  the  distance  D  of  a  clustering  really  is  the  minimum  inter¬ 
cluster  “distance”  d  between  any  pair  of  clusters,  where  d  has  a  monotone  property:  If  two 
clusters  have  distance  d ,  and  points  are  then  added  to  each,  the  distance  d  cannot  increase. 
In  other  words,  as  clusters  “grow” ,  the  distance  between  them  shrinks,  and  D  is  the  minimum 
of  all  these  inter-cluster  distances. 

It  is  not  clear  what  is  meant  by  the  “distance”  of  a  clustering  C  when  C  contains  only 
a  single  cluster.  Generally,  one  is  only  interested  in  clusterings  that  contain  more  than  one 
cluster.  A  possible  way  to  deal  with  this  is  to  simply  let  the  value  of  G  be  zero  for  any  such 
clustering,  or  to  let  the  value  of  D  be  zero.  For  the  sake  of  generality,  we  have  chosen  the 
weakest  requirement,  which  is  to  allow  D  to  be  defined  arbitrarily  for  one-cluster  clusterings, 
but  to  require  that  D  be  monotone  under  generalization.  Perhaps  more  natural,  but  also 
more  restrictive,  would  be  to  require  that  the  value  of  D  be  the  same  for  all  one-cluster 
clusterings,  or  to  let  G  be  defined  as  a  function  of  T(C)  alone  when  C  has  only  one  cluster. 

In  steps  2  -  4  of  the  example,  the  algorithm  merged  monomials  in  an  obvious  way 
to  produce  new  clusters.  Also,  the  clusters  produced  by  merging  monomials  were  always 
maximally  specific  covers  (for  the  points  covered  by  the  merged  clusters).  Property  P4  is 
related  to  P2.  It  requires  that  we  be  able  to  generate,  in  polynomial  time,  an  MSC  for  the 
union  of  any  two  sets  of  points  in  the  problem  space  described  by  clusters: 


Property  P4:  There  is  an  effective  procedure  M  such  that  for  any  n ,  M  “merges”  any  two 
clusters  c,c'  £  Ln.  For  all  n,  M  and  Ln  must  have  the  following  properties: 

(a)  For  all  c,  c'  £  Ln,  there  is  an  MSC  c"  for  In(c )  U  /„(c'),  and  M[n,  c,  c')  =  c" . 

(b)  M  runs  in  polynomial  time,  i.e.,  in  time  polynomial  in  n ,  and  in  the  lengths  of 
the  statements  c  and  c' . 


(c)  There  is  a  polynomial  q  such  that  for  any  finite  subset  S  of  Xn,  if  c  is  obtained  by 
any  (finite)  number  of  merges  of  MSCs  of  subsets  of  S ,  then  c  has  size  at  most 
q(n,  |5|). 
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Part  (c)  assures  that  the  following  event  does  not  occur:  During  a  run  of  the  agglom- 
erative  algorithm,  clusters  are  “merged”  successively.  It  is  possible  that  at  some  point,  the 
description  of  a  cluster  is  larger  than  some  polynomial  in  the  size  of  S.  Any  reasonable 
restriction  on  the  size  of  the  statement  M[n,c,c')  in  part  (b)  will  not  prevent  this  event 
from  occurring.  For  example,  since  there  are  at  most  |5j  iterations,  even  if  we  require  that 
the  length  of  the  statement  M(n,  c,  c ')  is  at  most  the  sum  of  the  lengths  of  the  statements  c 
and  c' ,  it  is  possible  that  the  final  clustering  obtained  by  the  algorithm  will  have  size  expo¬ 
nential  in  I*?!-  By  requiring  part  (c)  in  addition  to  part  (b),  we  guarantee  that  any  statement 
produced  by  the  algorithm  has  size  at  most  polynomial  in  the  size  of  S . 

In  many  cases  (e.g.,  when  clusters  are  conjunctive  descriptions  over  any  collection  of 
attributes),  the  size  of  descriptions  will  decrease  as  clusters  become  more  general.  In  other 
cases  (e.g.,  axis-aligned  rectangles  in  Euclidean  spaces),  description  size  will  remain  constant. 
In  cluster  analysis,  property  P4  is  trivially  satisfied,  since  merging  is  done  by  union  (of  sets 
of  points),  and  the  largest  statement  is  exactly  S. 

Property  P5  is  based  on  the  observation  that,  in  the  monomial  example,  clusters  became 
less  cohesive  (more  general  or  less  tight )  as  merges  occurred.  We  need  the  following  definition: 

Definition  3.2  Given  Xn,  Ln,  and  In,  the  relation  <n  on  Kn  is  defined  by:  For  all  C,  C'  £ 
Kn,  C  <ln  C'  iff  (Vc  £  (7)  (3c'  £  C')  In(c )  C  In(c').  If  C  <n  C'  we  say  C  is  less  general  than, 
more  specific  than,  and  is  a  specialization  of,  C' ,  and  equivalently,  that  C'  is  more  general 
than,  less  specific  than,  and  is  a  generalization  of,  C .  When  restricted  to  prime  clusterings 
with  respect  to  a  given  set  S ,  <n  is  a  partial  order  on  Kn. 

Property  P5 :  The  tightness  functions  ff  =  {Tn}  are  a  uniformly  polynomial  time  com¬ 
putable  family  of  functions,  and  for  all  n,  the  function  Tn  is  monotone  nonincreasing 
under  generalization ,  i.e.,  for  all  C,  C'  £  Kn,  if  C  <]n  C'  then  Tn((7)  >  Tn  (C'). 

Property  P5  asserts  that  if  one  clustering  is  a  generalization  of  another,  then  the  more 
general  clustering  is  at  most  as  tight  as  the  less  general.  Because  we  have  defined  tightness  as 
a  function  of  clusterings,  and  not  of  individual  clusters,  it  is  not  immediately  clear  that  this 
is  a  natural  property.  Observe,  however,  that  the  definition  of  the  relation  <n  states  that 
each  cluster  of  the  less  general  clustering  is  contained  in  some  cluster  of  the  more  general 
clustering.  Thus  if  Tn  somehow  depends  on  the  “tightness”  of  particular  clusters  (e.g.,  if 
tightness  of  individual  clusters  is  inversely  related  to  the  quantity  or  variety  of  elements 
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covered),  then  the  more  general  clustering  contains  clusters  at  most  as  tight  as  the  clusters 
of  the  less  general  clustering.  We  would  then  expect  that  the  overall  value  of  Tn  would  be 
more  for  the  less  general  clustering. 

Finally,  property  P6  simply  says  that  goodness  has  a  very  natural  property:  If  you 
increase  either  distance  or  tightness,  while  holding  the  other  constant,  then  goodness  should 
not  decrease.  In  other  words,  tight,  distant  clusterings  are  best. 

Property  P6:  The  goodness  functions  Q  =  {Gn}  are  a  uniformly  polynomial  time  com¬ 
putable  family  of  functions,  and  for  all  n ,  the  function  Gn  is  monotone  nondecreasing 
in  Tn  and  Dn.  That  is,  if  x\  >  x2  £  range(Tn )  and  yi  >  yi  6  range{Dn),  then 
Gn{x1,y1)  >  Gn(x2,yi)  and  Gn{x1,y1)  >  Gn{xuy2). 

3.3  Example  Clustering  Problems 

The  properties  Pi  through  P6  hold  for  several  interesting  conceptual  clustering  and  cluster 
analysis  problems  that  fall  within  our  framework.  In  this  section,  we  present  some  of  these, 
without  proof  that  they  do  indeed  satisfy  the  properties. 

Monomials  The  properties  Pi  -  P6  were  motivated  by,  and  are  natural  generalizations  of, 
properties  held  by  the  monomial  clustering  problem.  It  is  easily  verified  that  these 
properties  are  satisfied  by  the  monomial  clustering  problem  as  defined  in  Section  3.1. 
The  properties  also  hold  for  conjunctive  conceptual  clustering  using  multiple-valued 
attributes  and  internal  disjunction  (Michalski  &;  Stepp,  1983). 1  In  this  case,  the  initial 
clustering  is  as  for  monomials  and  the  “refunion”  operator  (Michalski,  1983)  can  be 
used  for  merging.  Natural  extensions  of  the  distance  and  tightness  measures  in  the 
example  satisfy  P3  and  P5,  and  these  may  be  used  with  any  objective  function  satisfying 

Pe- 

Geometric  Another  interesting  group  of  languages  which  have  these  properties  are  some 
geometric  languages  over  Euclidean  spaces.  For  example,  the  agglomerative  algorithm 
can  solve  the  axis-aligned  rectangle  clustering  problem ,  defined  by: 

•  X  =  {Xn},  where  Xn  is  ra- dimensional  Euclidean  space. 

1  Some  of  the  metrics  used  by  Michalski  fe  Stepp,  e.g.,  “simplicity”  and  “sparseness”,  clearly  do  not  satisfy 
properties  P2  and  P5. 
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•  £  =  {Ln},  where  Ln  is  the  set  of  ra-dimensional  rectangles  with  sides  parallel  to 
the  n  axes. 

•  X  =  {In}  is  the  standard  interpretation:  In(r),  where  r  is  a  rectangle  of  Ln,  is  the 
set  of  points  of  Xn  that  are  contained  in  r. 

•  T  =  {Tn},  where  Tn(C)  for  a  collection  of  rectangles  C  could  be  any  of 

1.  The  inverse  of  the  area  of  the  union  of  the  rectangles  of  C. 

2.  The  inverse  of  the  area  of  the  largest  rectangle  of  C . 

3.  The  inverse  of  the  maximum  distance  between  any  two  points  within  any 
cluster,  i.e. ,  the  inverse  of  the  length  of  the  longest  diagonal. 

•  T>  =  {Dn},  where  Dn( C)  is  the  minimum  pairwise  “distance”  dn  between  any  pair 
of  rectangles  of  C ,  and  dn  is  any  metric  that  gets  smaller  as  clusters  grow.  (Some 
metrics  dn  that  do  not  have  this  property  include  G?n(ri,r2)  =  maximum  distance 
between  any  pair  of  points  of  Tq  and  r2,  or  distance  between  the  centers  of  rq  and 
r2.)  Let  Dn(C)  =  0  if  C  has  only  one  cluster. 

•  Q  =  {Gn}  is  any  objective  function  satisfying  P6. 

The  critical  condition  for  geometric  languages  is  that  there  exist,  and  we  can  find,  a 
description  of  the  smallest  set  representable  in  the  language  that  covers  a  given  set  of 
points.  For  example,  if  the  language  consists  of  descriptions  of  all  convex  polygons  in 
2-dimensional  Euclidean  space,  then  it  is  easy  to  see  that  the  agglomerative  algorithm 
may  be  successfully  applied,  since  the  convex  hull  of  a  set  of  points  (which  may  be 
found  in  polynomial  time)  is  contained  in  every  convex  polygon  containing  the  points. 
However,  if  £  =  {Ln},  where  Ln  consists  of  descriptions  of  convex  polytopes  in  n 
dimensions  (i.e.,  a  list  of  ( n  —  l)-dimensional  hyperplanes),  then  property  P2  (and 
P4)  do  not  hold,  because  the  length  of  a  description  of  the  convex  hull  of  a  set  of 
s  points  in  n  dimensions  (i.e.,  the  length  of  the  description  of  the  MSC  of  a  set  of 
points)  can  be  as  large  as  (Edelsbrunner,  1987).  If  we  are  willing  to  relax  the 
requirement  that  the  clustering  found  have  size  polynomial  in  the  dimension,  then 
the  agglomerative  algorithm  can  be  used  to  find  an  optimal  clustering.  The  MSCs 
are  obtained  by  applying  any  algorithm  for  finding  the  convex  hull  of  a  set  of  points 
in  n  dimensions.  Alternatively,  by  using  a  different  representation  of  the  convex  hull 
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(e.g.,  the  extremal  points),  then  properties  Pi  -  P6  will  hold,  and  the  agglomerative 
algorithm  may  be  used  to  find  optimal  clusters  in  time  polynomial  in  the  dimension 
and  the  initial  number  of  points. 

Cluster  Analysis  Within  our  framework,  any  cluster  analysis  problem  trivially  satisfies 
properties  Pi,  P2,  and  P4.  Whether  properties  P3,  P5,  and  Pe  are  satisfied  will  de¬ 
pend  on  the  particular  choice  of  the  objective  functions.  Single  linkage  clustering 
(Anderberg,  1973),  in  which  the  distance  between  clusters  ( dn  in  our  framework)  is 
the  minimum  distance  between  points  in  the  clusters,  clearly  satisfies  the  monotone 
requirement  for  distance  metrics  in  property  P3.  Complete  linkage  clustering  ( dn  is 
the  maximum  inter-point  distance)  clearly  does  not.  Most  reasonable  choices  of  Q  will 
satisfy  property  P6. 


4  The  Agglomerative  Algorithm 
4.1  Algorithm  A 

The  algorithm  we  consider  is  the  Central  Agglomerative  Procedure  as  described  by  Anderberg 
(1973).  Variants  of  this  method  constitute  the  majority  of  the  work  on  hierarchical  cluster¬ 
ing  (Romesburg,  1984).  Hierarchical  clustering  techniques  are  normally  used  to  produce 
a  classification  tree  over  the  object  set,  where  leaves  are  individual  objects,  and  internal 
nodes  represent  clusters.  We  will  instead  be  concerned  with  whether  the  technique  finds 
a  single  clustering  which  is  best  under  the  objective  function.  In  this  sense,  we  are  using 
the  agglomerative/hierarchical  procedure  as  an  optimization  technique  (Everitt,  1980),  but 
without  fixing  the  number  of  clusters  beforehand.  We  are  also  allowing  the  algorithm  to 
produce  non-disjoint  clusters  (cf.,  clumping  techniques  (Everitt,  1980)). 

Given  n  and  a  finite  nonempty  set  S  C  Xn,  the  algorithm  produces  t  <  |5j  different 
clusterings  Ci,  C2,  ■  ■  ■  , Ct,  by  starting  with  the  maximally  specific  cluster  for  each  point  in 

5  and  successively  merging  clusters  with  minimum  distance  until  a  single  cluster  covering 
all  of  S  is  obtained.  After  each  merge,  extraneous  clusters  are  eliminated.  The  output  of 
the  algorithm  is  the  clustering  among  Ci,  C2,  .  .  .  Ct  with  the  best  value.  We  will  prove  that 
the  algorithm  solves  any  clustering  problem  ( X,  C,  X,  T,  T),  Q)  that  satisifes  properties  Pi 
through  Pq .  (The  algorithm  itself  implicitly  assumes  that  properties  Pi,  P2,  P3  part  (a),  and 
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P4  hold.) 


Agglomerative  Algorithm  A 

INPUT(n,  S ) 

FOR  each  X{  £  S,  compute  Cj,  an  MSC  for  {xi} 

C[emp  <-  {Ci  :  Si  £  S} 

C1  «-  PRIME(n,  C[emp,  S ) 

IF  | Ci |  =  1  THEN  done  «-  TRUE,  ELSE  done  <-  FALSE 
i  4-  1 

WHILE  done  ^  TRUE  DO  BEGIN 
i  £-  i+1 

compute  dn{cj,Ck)  for  each  Cj,Ck  £  Ci- 1 

let  c,  c'  £  Ci-i  be  such  that  Dn[Ci- 1)  =  dn(c,c') 

(c  and  c'  are  the  two  closest  clusters  of  Ci- 1.) 

C\temp  «-  Ci _!  -  {c,  c'}  U  M(n,  c,  c') 

(Ctmp  is  Ci- 1  with  clusters  c  and  c!  merged.) 

Ci PRIME (n,  C'-emp,  5) 

(eliminate  any  extraneous  clusters.) 

IF  |Ci|  =  1  THEN  done  «-  TRUE 

END 

t  ■k—  i  (index  of  final  clustering  formed) 

OUTPUT  any  C  £  {Ci,  .  .  .  Ct }  such  that  Gn(C )  is  maximum. 

Theorem  4.1  If  [X ,  £,  X,  is  any  clustering  problem  such  that  properties  Pi  through 

Pq  are  satisfied,  then  A  solves  (A,  C,  I,  T ,  "D,  Q)  in  polynomial  time. 

Proof:  By  property  P2,  the  clustering  deTnp  is  found  in  polynomial  time.  Note  that 

d  =  {dn},  T>  =  {Dn} ,  T  =  {Tn},  and  Q  =  {Gn}  are  uniformly  polynomial  time  computable 
families  (properties  P3,P5,  and  P6),  and  that  the  merge  operation  M  never  produces  a 
statement  of  length  greater  than  q{n,  l^l)  (property  P4  parts  (b)  and  (c)).  Subroutine  PRIME 
runs  in  polynomial  time  by  the  comments  following  the  introduction  of  property  P\.  Since 
there  are  t  <  l^l  iterations  (because  |Ci+i|  <  | C** | ) ,  the  algorithm  runs  in  time  polynomial 
in  | S' |  and  n.  We  need  only  show  that  the  algorithm  is  correct. 
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Lemma  4.2  Let  ( Xn ,  Ln,  In,  Tn,  Dn,  Gn,  S)  be  an  instance  of  a  clustering  problem  that  sat¬ 
isfies  properties  Pi  through  P6.  Let  C  6  Kn  be  a  prime  clustering  of  S ,  a  specialization  of 
some  best  clustering ,  and  suppose  that  C  itself  is  not  a  best  clustering.  Let  c,c'  (E  C  be  such 
that  Dn[C )  =  dn(c,  c'),  and  let  C'  —  C  —  {c,  c'}  U  M(n,  c,  c').  (In  other  words,  C'  is  obtained 
from  C  by  merging  two  clusters  with  minimum  distance  dn.)  Then  C'  is  a  specialization  of 
a  best  clustering,  as  is  PRIME{n,  C',  S). 

We  first  show  that  Theorem  4.1  follows  from  Lemma  4.2,  and  then  prove  Lemma  4.2.  To 
prove  the  theorem,  we  need  only  show  that  at  least  one  of  the  clusterings  {Ci,  C 2,  .  .  . ,  Ct} 
is  a  best  clustering. 

Suppose  by  way  of  contradiction  that  none  of  the  Cf  s  is  a  best  clustering.  By  the 
definition  of  the  maximally  specific  cover  (MSC)  of  a  set  of  points  P,  any  cluster  which 
covers  P  must  cover  a  superset  of  the  points  covered  by  the  MSC.  c[eTrip  consists  of  the 
MSCs  for  each  point  in  S.  Thus,  (7*emp  is  a  specialization  of  every  clustering  of  S  and 
therefore  of  a  best  clustering.  Trivially,  C\  is  a  specialization  of  a  best  clustering,  and  is  a 
prime  clustering  of  S.  By  Lemma  4.2  (and  the  definition  of  Clemp),  Clemp  is  a  specialization 
of  a  best  clustering,  as  is  C2 ■  Iteratively  applying  Lemma  4.2  and  our  supposition  that 
none  of  the  CVs  are  best,  we  have  that  each  of  C 1,  C2, .  .  .  ,Ct  is  a  speciahzation  of  a  best 
clustering.  (Each  is  also  a  prime  clustering  of  S .)  But  this  is  a  contradiction,  for  Ct  cannot 
be  a  specialization  of  a  best  clustering  without  being  a  best  clustering:  Since  Ct  has  only  one 
cluster,  and  only  prime  clusterings  are  candidate  solutions,  any  generalization  Cbest  of  Ct  must 
have  exactly  one  cluster  that  contains  the  single  cluster  of  Ct-  Further,  Tn(Cbest)  <  Tn(Ct) 
and  Dn(Cbest)  <  Dn(Ct)  by  properties  P5  and  P3  part  (b).  By  P6,  Gn(Cbest)  <  Gn{Ct),  and 
thus  Ct  is  in  fact  a  best  clustering.  It  follows  that  our  supposition  was  wrong,  and  at  least 
one  of  {Ci,  C 2,  ■  ■  ■ ,  Ct}  must  be  a  best  clustering.  □ 

We  now  prove  Lemma  4.2.  Let  Cbest  be  a  best  clustering,  with  C  <ln  Cbest,  and  C  a  prime 
clustering  of  S,  but  not  a  best  clustering.  Then  Gn(C )  <  Gn(Cbest). 

Let  c,c',  and  C'  be  as  defined  in  the  lemma.  (Thus  dn[c,c ')  =  Dn(C).)  Since  C<inCbest, 
there  are  clusters  b,  b'  £  Cbest  such  that  In(c )  C  7n(6)  and  /n(c')  C  /„(&').  There  are  now  two 
cases: 

Case  1:  b  =  b' 

Then  the  only  cluster  of  C'  that  is  not  also  a  cluster  of  C  is  M(n,c,c').  Observe 
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that  /n(c)  U  In{c')  C  In(b),  and  by  the  definition  of  M[n,c,c ')  as  maximally  specific, 
7n(M(n,  c,  c'))  C  In(b).  Thus  C'  is  a  specialization  of  Cbest.  Trivially,  PRIME(n,  C',  S ) 
is  also  a  specialization  of  Cbest,  and  the  lemma  is  proved. 

Case  2:  b  ^  b' 

In  this  case,  Tn[C )  >  Tn{Cbest)  by  property  P5,  and  now  note  that: 

Dn(C)  =  dn(c,c')  (by  choice  of  c,c') 

>  dn(b,  b')  (by  property  P3  part  (b)) 

>  Dn(Cbest )  (by  property  P3  part  (a)). 

Since  Tn(C)  >  Tn(Cbest )  and  Dn(C)  >  Dn(Cbest),  by  property  P6,  Gn(C)  >  Gn(Cbest), 
contradicting  the  hypothesis  of  the  lemma  that  C  is  not  a  best  clustering.  Thus  case  1 
must  hold,  completing  the  proof  of  Lemma  4.2  and  Theorem  4.1.  □ 

4.2  Properties  of  Algorithm  A 

It  is  interesting  to  note  that  algorithm  A  is  not  a  hill-climbing  algorithm,  in  that  the  value  of 
the  objective  function  may  increase  and  decrease  as  the  sequence  of  clusters  Ci,  C3,  ■  ■  ■ ,  Ct 
is  formed.  (Recall  the  monomial  example  in  Section  3.1.)  It  is  true,  however,  that  the 
measure  of  tightness  Tn  is  monotone  nonincreasing  as  each  new  clustering  is  examined.  The 
function  Dn  is  not  necessarily  monotone,  because  it  is  possible  for  Dn  to  increase  when 
two  clusters  are  merged  (since  the  minimum  distance  is  eliminated),  and  also  to  decrease 
(since  the  new  larger  cluster  may  be  very  close  to  some  other  cluster).  It  is  for  this  reason 
that  the  algorithm  must  continue  generating  clusterings  rather  than  stopping  once  the  value 
of  Gn  decreases.  Under  some  objective  functions,  the  algorithm  may  hill-climb.  Single¬ 
linkage  cluster  analysis  problems  (Anderberg,  1973),  for  example,  define  dn  as  the  minimum 
“distance”  between  points  of  two  clusters  (where  “distance”  is  any  metric).  If  such  a  problem 
satisfies  properties  Pi  -  P6,  then  the  agglomerative  algorithm  will  hill-climb  on  the  objective 
function. 

It  is  also  worth  noting  that  algorithm  A  finds,  for  each  k  <  s,  the  best  clustering  with  at 
least  k  clusters.  We  will  show  that  for  1  <  k  <  s,  the  best  clustering  with  at  least  k  clusters 
is  in  the  set  {C i,  C3,  ■  ■  ■  ,  Ct}-  This  is  achieved  by  proving,  for  each  fixed  k  <  s,  the  following 
variant  of  Lemma  4.2. 

Let  bestk  mean  “best  among  all  prime  clusterings  of  S  with  at  least  k  clusters”. 
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Lemma  4.3  Let  ( Xn ,  Ln,  In,  Tn ,  Dn,  Gn,  S)  be  an  instance  of  a  clustering  problem  that  sat¬ 
isfies  properties  Pi  through  P6.  Let  C  6  Kn  be  a  prime  clustering  of  S  containing  at  least 
k  clusters.  Further,  let  C  be  a  specialization  of  some  best ^  clustering,  and  suppose  that 
C  itself  is  not  a  best ^  clustering.  Let  c,c'  £  C  be  such  that  Dn[C )  =  dn[c,  c'),  and  let 
C'  —  C  —  {c,  c'}  U  M(n,  c,  c') .  (In  other  words,  C'  is  obtained  from  C  by  merging  two  clus¬ 
ters  with  minimum  distance  dn.)  Then  C'  is  a  specialization  of  a  bestk  clustering,  as  is 
PRIME(n,C',S). 

Lemma  4.3  differs  from  Lemma  4.2  only  in  that  “best  clustering”  has  been  replaced  with 
“bestk  clustering”  and  the  additional  hypothesis  that  C  has  at  least  k  clusters  has  been 
added.  The  proof  of  Lemma  4.3  is  nearly  identical  to  the  proof  of  Lemma  4.2.  (One  needs 
the  fact  that  C  has  at  least  k  clusters  to  arrive  at  the  contradiction  in  Case  2.) 

We  can  now  prove 

Theorem  4.4  If  (<T,  C,T,  /T,P))Q)  is  any  clustering  problem  such  that  properties  Pi  through 
Pq  are  satisfied,  then  for  each  k  <  l^l,  the  set  of  clusterings  {Ci,  C2,  ■  ■  ■  ,  Ct}  produced  by 
algorithm  A  contains  a  best &  clustering  (if  one  exists). 

To  prove  Theorem  4.4,  we  assume  that  for  some  k  <  |,f>|,  a  best ^  clustering  exists  (one 
could  fail  to  exist  because  every  prime  clustering  could  have  fewer  than  k  clusters),  and 
that  none  of  the  (prime)  clusterings  {C 1,  C 2, .  .  .  ,  Ct}  is  a  best *.  clustering.  We  then  obtain  a 
contradiction. 

Consider  the  sequence  of  clusterings  (7iemp,  Ci,  Clemp ,  C2,  ■  ■  ■  ,  C'ttemp,  Ct,  produced  during 
the  run  of  algorithm  A.  Since  C  =  (7*emp  satisfies 

(a)  C  has  at  least  k  clusters 

(b)  C  is  a  speciahzation  of  a  best clustering, 

there  is  a  rightmost  element  R  of  this  sequence  of  clusterings  that  satisfies  (a)  and  (b).  There 
are  3  cases,  each  resulting  in  a  contradiction: 

Case  1:  For  some  i,  1  <  i  <  t,  R  =  C{.  Then,  since  R  is  a  prime  clustering  of  S,  by  (a), 
(b),  our  assumption  that  none  of  the  Cfs  is  a  best *.  clustering,  and  Lemma  4.3,  we 
conclude  that  Cl^(p  is  a  specialization  of  a  bestk  clustering.  Then  C'*+™p  must  have 
less  than  k  clusters,  otherwise  R  =  C(ff(p  instead  of  Ci.  Since  Ci  and  C\efifp  differ 
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in  number  of  clusters  by  exactly  one,  it  follows  that  Ci  has  exactly  k  clusters.  Let 
Cheat  be  a  (prime)  generalization  of  Ci  that  is  a  best clustering.  Then  Cheat  must  have 
exactly  k  clusters,  each  a  superset  of  a  different  cluster  of  C{.  Thus  Tn[Ci)  >  Tn[Cbeat), 
Dn(Ci)  >  Dn[Cbest),  and  Gn{Ci )  >  Gn(Cbeat),  contradicting  the  assumption  that  Ci  is 
not  a  bestk  clustering. 

Case  2:  R  =  Ct ■  Then  since  Ct  has  exactly  one  cluster,  k  —  1.  In  other  words,  Ct  has 
exactly  k  clusters,  and  the  reasoning  concluding  case  1  above  may  be  employed. 

Case  3:  For  some  i,  1  <  i  <  t,  R  —  Clemp.  Since  C'*emp  is  a  specialization  of  some  best^ 
clustering  Cheat,  so  is  C{.  Then  Ci  must  have  less  than  k  clusters,  otherwise  R  =  Ci. 
Both  Ci  and  Cheat  are  prime  clusterings  of  S ,  so  Cheat  can  have  at  most  the  same  number 
of  clusters  as  Ci  (one  superset  of  each  c  £  Ci).  Therefore,  Cheat  has  less  than  k  clusters, 
and  is  not  a  best *.  clustering,  a  contradiction. 

Since  in  each  case  we  have  arrived  at  a  contradiction,  our  assumption  that  none  of 
the  clusterings  {Ci,  C 2, .  .  .  ,  Ct}  is  a  best ^  clustering  must  be  false,  completing  the  proof  of 
Theorem  4.4.  □ 

A  natural  question  is  whether  it  is  possible  to  find  a  best  clustering  with  exactly  k 
clusters.  Certainly  this  is  at  least  as  difficult  as  finding  a  best  clustering  with  at  most 
k  clusters,  since  an  algorithm  for  the  former  problem  could  be  run  k  times  to  find  best 
clusterings  with  exactly  1,  2,  3, .  .  .  ,  k  clusters,  and  the  best  could  be  chosen  as  an  answer  to 
the  latter  problem.  We  show  that  there  exists  a  clustering  problem  satisfying  Pi  -  P6  such 
that  unless  P  =  NP,  no  polynomial  time  algorithm  is  guaranteed  to  find,  for  all  instances 
of  the  problem,  a  best  clustering  with  at  most  k  clusters. 

It  would  appear  that  a  simple  reduction  from  the  NP-hard  CLUSTERING  problem 
(Garey  &;  Johnson,  1979)  would  be  sufficient  to  show  this.  However,  due  to  the  definition  of 
what  an  “instance”  of  each  problem  is,  a  straightforward  approach  relating  the  “distance” 
function  of  CLUSTERING  to  any  of  our  measures  ‘T,'D,  or  Q  will  not  work. 

We  sketch  a  proof  that  there  is  a  clustering  problem  (A,  C,  X,  T,  2?,  Q)  satisfying  Pi 
through  P6  such  that  for  each  number  k  >  3,  the  problem  of  finding  for  all  instances 
(Xn,  Ln,  In,Tn,  Dn,Gn,  S),  a  best  clustering  among  those  with  at  most  k  clusters,  is  NP- 
hard.  Our  example  is  a  cluster  analysis  problem,  thus  for  each  n,  Ln  =  {c  :  c  is  a  finite 
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subset  of  Xn},  and  In  is  the  identity  function.  As  in  all  cluster  analysis  problems  within  our 
framework,  properties  Pi,  P2,  and  P4  hold  immediately. 

Let  k  >  3  be  given,  and  let  «T,T,  D,  and  Q  be  as  defined  below: 

X  =  {Jf2„},  where  X2v  is  the  set  of  (even)  length  2v  strings.  A  given  string  of  length 
2v  will  represent  a  vertex  in  an  undirected  graph  of  v  vertices  if  the  first  half  of  the  string 
contains  a  single  “1”  bit  .  The  single  “1”  among  the  first  v  bits  indicates  which  vertex 
it  is,  and  the  remaining  v  bits  give  adjacency  information  with  other  vertices,  i.e.,  a  “1” 
in  position  v  +  j  indicates  that  the  vertex  is  adjacent  to  vertex  j .  Note  that  any  graph 
with  v  vertices  may  be  represented  by  a  finite  set  of  points  of  X2v,  although  not  every 
finite  subset  of  X2v  represents  a  graph.  For  example,  the  graph  of  5  vertices  with  edges 
(1,  2),  (1,  3),  (1,5),  (2,  3),  (4,  5),  (3,  5)  corresponds  to  the  subset  of  A\o  given  by  the  elements 
{m i ,  x2,  x3,  *4,  *5}  below:  (a  comma  is  inserted  between  the  5th  and  6th  bits,  and  each 
element  is  parenthesized  to  aid  the  interpretation.) 


xi  =  (10000,01101) 

x2  =  (01000,10100) 
*3  =  (00100,11001) 
x4  =  (00010,00001) 
x5  =  (00001,10110) 


(vertex  1  adjacent  to  2,3,  and  5) 
(vertex  2  adjacent  to  1  and  3) 
(vertex  3  adjacent  to  1,2,  and  5) 
(vertex  4  adjacent  to  5) 

(vertex  5  adjacent  to  1,3,  and  4) 


Given  as  input  any  even  number  2v  and  finite  subset  S  of  X2v,  it  is  decidable  in  polynomial 
time  whether  S  represents  a  subset  of  the  vertices  of  some  undirected  graph  of  v  vertices, 
or  whether  no  undirected  graph  has  a  subset  of  vertices  represented  by  the  elements  of  S . 
(What  must  be  checked  is  that  (1)  Each  string  of  S  has  a  single  “1”  among  the  first  v  bits; 

(2)  For  each  i  <  v,  there  is  at  most  one  string  in  S  with  a  single  “1”  in  position  i ;  and 

(3)  If  a  string  representing  a  vertex  numbered  i  has  a  “1”  in  position  v  +  j,  then  the  string 
representing  vertex  j  (if  it  appears  in  S)  has  a  “1”  in  position  v  +  i.) 

For  a  clustering  C ,  Let  T2v[C )  =  0  if  the  union  of  the  clusters  of  C  is  not  a  finite  sub¬ 
set  of  X2v  representing  a  subset  of  the  vertices  of  some  undirected  graph,  OR  if  there  is 
a  cluster  c  £  C  such  the  representations  of  two  vertices  which  are  adjacent  in  the  repre¬ 
sented  subgraph  are  both  contained  in  c.  Let  T2v(C)  =  1  otherwise.  In  the  example  above, 
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Tw{{{xi,  x^},  {x2,  *5}})  =  1,  since  the  elements  of  the  clusters  are  consistent  with  some 
5  vertex  undirected  graph,  and  no  two  adjacent  vertices  appear  in  any  single  cluster.  On 
the  other  hand,  Tjo({{*i,  x2},  {*3}})  =  0,  since  in  any  graph  which  contains  the  vertices 
Xi,  x2,  and  x3,  X\  is  adjacent  to  x2  and  they  appear  in  the  same  cluster.  As  a  final  example, 
Tjo({{1001},  {0100}})  =  0,  because  the  adjacency  information  between  vertex  1  and  2  in 
the  two  vertex  graph  represented  is  inconsistent. 

Let  D2v  be  the  constant  function  D2v[C )  =  2v,  and  let  G2v(C )  =  min[T2v[C),  D2v[C)). 

Now  it  is  easily  verified  that  the  clustering  problem  (A,  C,  X,  T,  'D,  G)  satifies  P3,  P5  and 
Pq .  Since  this  is  a  cluster  analysis  problem,  it  also  satisfies  Pi,  P2,  and  P4. 

We  reduce  the  NP-hard  graph  A- color  ability  problem  (Garey  &;  Johnson,  1979)  to  the 
problem  of  finding  a  best  solution  among  all  clusterings  having  at  most  k  clusters  for  the 
problem  (A,  C,  I,  T,  T>,  Q).  For  each  k  >  3,  the  graph  &-colorability  problem  is  to  determine 
whether  there  is  a  coloring  of  the  vertices  of  a  graph  using  at  most  k  colors,  so  that  no  two 
adjacent  vertices  have  the  same  color.  Given  a  graph  A  =  (V,  E),  with  v  vertices,  we  form 
an  instance  (X2v,  L2v,  I2v,  T2v,  D2v,  G2v,  S)  of  the  clustering  problem  above  by  letting  the  set 
S  C  X2v  to  be  clustered  be  exactly  those  elements  of  X2v  which  represent  vertices  V  of  the 
graph  A  with  adjacency  information  given  by  E.  A  simple  argument  shows  that  the  graph 
A  is  ^-colorable  iff  there  is  a  clustering  C  for  this  instance  with  at  most  k  clusters  such  that 
G2v(C)  =  1.  (The  clusters  consist  of  representations  of  vertices  to  be  colored  with  the  same 
color.)  Otherwise,  any  clustering  for  this  instance  with  at  most  k  clusters  has  G2v(C)  =  0. 
It  follows  that  for  each  k,  any  algorithm  for  solving  (X ,  C,I,T ,T),  G)  by  finding  the  best 
clustering  with  at  most  k  clusters  can  be  used  to  solve  the  graph  fc-colorability  problem.  We 
have  thus  proved 

Theorem  4.5  For  all  k  >  3  there  are  clustering  problems  (A,  C,  I,  T,  T>,  G)  satisfying  prop¬ 
erties  Pi  through  Pq  such  that,  unless  P  =  N P ,  there  does  not  exist  a  polynomial  time 
algorithm  for  finding  a  best  clustering  among  all  clusterings  with  at  most  k  clusters  for  every 
instance  ( Xn ,  Ln ,  /n,  Fn,  Dn,  Gn ,  P). 

4.3  Variants  of  T  and  D 

Although  Theorem  4.1  shows  that  only  very  general  assumptions  on  the  functions  G  are 
needed,  the  results  apply  only  when  the  functions  T  satisify  property  P5,  and  the  functions 
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T>  satisfy  property  P3.  Measures  of  tightness  such  as  “density”  of  individual  clusters  allow 
the  tightness  to  increase  as  a  clustering  is  generalized,  since  “sparse”  clusters  may  become 
“dense”  when  new  points  are  added.  Thus  property  P5  is  violated  for  this  type  of  measure. 
Similarly,  if  the  distance  functions  { Dn }  are  defined  as  the  maximum  intercluster  distance 
dn,  property  P3  is  no  longer  satisfied.  In  this  section  we  show  that  (assuming  P  ^  iVP),  P5 
is  necessary  in  the  sense  that  properties  Pi,  P2,  P3,  P4,  and  Pq  alone  are  not  sufficient  for  a 
clustering  problem  to  be  solvable  in  polynomial  time.  We  conclude  by  observing  that  if  the 
functions  {Dn}  are  in  fact  the  maximum  intercluster  distance,  then  the  clustering  problem 
is  trivial  (assuming  P5  still  holds). 

Theorem  4.6  There  is  a  clustering  problem  (X,  C,  T,  T,  T),  Q)  satisfying  Pi,  P2,  P3,  P4,  and 
P6  that  is  not  solvable  in  polynomial  time  unless  P  =  N P . 

Proof:  We  need  only  reduce  INDEPENDENT  SET,  an  NP-hard  problem  (Garey  fc 

Johnson,  1979),  to  a  cluster  analysis  problem  satisfying  properties  P3  and  P6.  An  instance 
of  INDEPENDENT  SET  is  a  graph  A  =  (V,  E ),  and  a  positive  integer  k  <  \V\.  The  problem 
is  to  determine  if  A  contains  an  independent  set  of  size  k  or  more,  i.e.,  a  subset  V'  C  V  such 
that  \V'\  >  k  and  such  that  no  two  vertices  of  V'  are  joined  by  an  edge  in  E. 

Let  the  clustering  problem  (X,  C,  T,  T,  T),  Q)  be  defined  as  in  the  proof  of  Theorem  4.5, 
except  that  T2v  has  the  modified  definition  given  by:  T2v[C )  =  0  if  the  union  of  the  clusters 
in  C  is  not  a  finite  subset  of  X2v  representing  all  of  the  vertices  of  some  undirected  graph 
with  v  vertices,  OR  if  there  is  a  cluster  c  £  C  such  the  representations  of  two  vertices  which 
are  adjacent  in  the  represented  graph  are  both  contained  in  c.  T2v[C )  =  the  number  of 
elements  in  the  largest  cluster  of  C  otherwise. 

[X ,  C,T,  T,  T),  Q)  satisfies  properties  Pi,  P2,  P3,  P4,  and  P6,  since  only  the  definition  of  T 
has  been  changed  from  the  proof  of  Theorem  4.5,  and  property  P5  has  been  dropped.  Also 
note  that  for  each  C,  D2v[C )  =  2v  >  T^C),  so  G2v(C)  =  T2v[C).  It  is  now  easily  shown 
that  a  graph  A  has  an  independent  set  of  size  k  iff  the  instance  ( X2v ,  L2v,  I2v,  T2v,  D2v,  G2v,  S ), 
with  S  representing  the  graph  A,  has  a  solution  C  with  G2v(C)  =  k.  □ 

Finally,  suppose  that  we  modify  part  (a)  of  property  P3  so  that  Dn  is  now  the  maximum 
inter-cluster  distance  dn(c ,  c')  among  all  clusters  c,  c'  £  C ,  where  dn  satisfies  part  (6)  of  prop¬ 
erty  P3.  Then  any  clustering  problem  [X ,  C,  T,  T,  T),  Q)  satisfying  properties  P2,P5,P6,  and 
this  modified  definition  of  P3,  is  trivially  solvable:  The  clustering  given  by  Ci  in  algorithm 
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A  must  be  a  best,  since  it  is  a  specialization  of  every  best  clustering;  tightness  cannot  in¬ 
crease  under  generalization,  nor  can  distance,  by  the  modified  property  P3.  Thus  Gn  cannot 
increase  either. 


5  Conclusion 

The  main  results  in  this  paper  can  be  summarized  as  follows: 

•  The  agglomerative  algorithm  will  find  a  best  (conceptual)  clustering  of  a  set  of  points 
if  the  similarity  measure  (for  clusterings)  and  the  dissimilarity  (between  clusters)  are 
monotone  with  respect  to  generalization,  the  objective  function  is  monotone  with  re¬ 
spect  to  similarity  and  dissimilarity,  and  the  language  is  tractable.  Tractable  in  this 
case  means  that  the  clusterings  of  the  language  are  not  too  large,  that  it  is  possible  to 
efficiently  determine  whether  a  point  is  in  a  cluster,  and  that  there  exists  and  it  is  pos¬ 
sible  to  find  the  most  specific  clustering  in  the  language  satisfying  certain  conditions. 
The  “identity”  language  for  cluster  analysis  trivially  has  these  properties. 

•  Under  these  same  conditions,  the  agglomerative  algorithm  will  find  a  best  clustering 
with  at  least  k  clusters  for  any  fixed  k  less  than  the  size  of  the  sample  set  being 
clustered. 

•  Finding  the  best  clustering  with  at  most  k  clusters  is  NP-hard  under  these  conditions. 

•  If  the  measure  of  similarity  is  not  monotone  with  respect  to  generalization,  then  find¬ 
ing  an  optimal  clustering  is  NP-hard,  even  if  the  other  monotone  properties  and  the 
language  properties  are  satisfied. 

These  results  have  several  interesting  imphcations.  First,  the  agglomerative  algorithm 
is  more  widely  applicable  than  would  be  expected  from  such  a  simple  technique.  Under 
straightforward  and  intuitively  natural  conditions  on  the  metric,  the  objective  function,  and 
the  cluster  description  language,  it  finds  a  best  clustering  in  polynomial  time.  The  language 
restrictions  are  satisfied  by  conjunctive,  attribute-based  languages,  including  those  using 
internal  disjunction.  They  also  apply  to  several  interesting  geometric  languages.  They  do 
not  hold  for  the  existentially-quantified  conjunctive  predicate  calculus  statements  sometimes 
used  to  represent  structured  objects  (Stepp,  1987a;  Larson,  1977). 
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Finally,  as  would  be  expected,  it  seems  that  finding  a  best  clustering  with  a  given  number 
of  clusters  is  hard.  The  implication  is  that  clustering  algorithms  which  try  to  find  a  best 
clustering  of  a  certain  size  will  have  to  be  content  with  sub-optimal  results.  It  also  confirms 
the  intuition  that  heuristic  techniques  and  domain  knowledge  are  probably  necessary  to 
produce  good  solutions. 

We  would  like  to  extend  the  results  to  metrics  that,  for  example,  include  notions  such 
as  density  or  average  similarity  over  clusters.  Additionally,  it  would  be  useful  to  be  able  to 
weaken  the  restrictions  on  distance  (for  a  clustering)  so  that  it  is  not  the  minimum  inter¬ 
cluster  distance. 

A  problem  we  have  not  addressed  here  is  the  notion  of  predictive  clustering,  along  lines  of 
learnability  as  described  by  Valiant  (1984),  and  Blumer,  Ehrenfeucht,  Haussler,  &  Warmuth 
(1986).  (See  also  (Kearns,  Li,  Pitt,  &;  Valiant,  1987).)  The  idea  is  to  develop  a  clustering 
which  is  “good”  for  an  entire  space  X  (under  an  unspecified  probability  distribution),  given 
only  randomly  generated  points  from  X.  We  have  definitions  that  seem  suitable  for  this 
problem,  and  have  some  preliminary  results  indicating  that  this  is  a  significantly  harder 
problem  than  nonpredictive  clustering.  These  results  may  be  presented  in  a  future  paper. 
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